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,__! Abstract 

Z^ In previous work, the author introduced the B-treap, a uniquely represented B-tree analogue, and 

proved strong performance guarantees for it. However, the B-treap maintains complex invariants and is 
very complex to implement. In this paper we introduce the B-skip-list, which has most of the guarantees 
of the B-treap, but is vastly simpler and easier to implement. Like the B-treap, the B-skip-list may 
^H be used to construct strongly history-independent index structures and filesystems; such constructions 

l/~j reveal no information about the historical sequence of operations that led to the current logical state. For 

example, a uniquely represented filesystem would support the deletion of a file in a way that, in a strong 
information-theoretic sense, provably removes all evidence that the file ever existed. 

Like the B-tree, the B-skip-list has depth 0{\ogg n) where B is the block transfer size of the external 
memory, uses linear space with high probability, and supports efficient one-dimensional range queries. 



1 Introduction 



J^ Uniquely represented data structures represent each logical state with a unique machine state. Because they 

\C> have canonical representations, such data structures are strongly history-independent, and thus reveal no 

^ information about the historical sequence of operations that led to the current logical state. This has useful 

applications in cases where strict privacy or high security is required, because most computer applications store 



a significant amount of information that is hidden from the application interface — sometimes intentionally 

(^ but more often not. This information might consist of data left behind in memory or disk, but can also consist 

^""i of much more subtle variations in the state of a structure due to previous actions or the ordering of the actions. 

I> Such unintentionally stored information might be leaked to the wrong people with disastrous consequences. 

k>( For example there are many embaiTassing stories about improperly redacted PDFs being released to the 

\-^ public. Chapter 1 of [9] gives examples of improper redactions involving leaks by the Central Intelligence 

Agency in 2000 revealing information about its role in the 1953 overthrow of the Iranian Government, by the 

United States Army in 2005 about the accidental shooting of Italian intelligence agent Nicola Calipaii in 

Iraq, and by the Federation Internationale de 1' Automobile in 2007 of confidential and valuable technical 

information on the McLaren and Ferrari racing cars. A more recent example involved the Transportation 

Security Administration revealing classified information about airport passenger screening practices in 2009. 

To address the concern of releasing historical and potentially private information various notions of history 

independence have been invented along with data structures that support these notions [12, 14, 11, 5, 1]. 

Unique representation had been studied even earlier [18, 19, 2], in the context of theoretical investigations 

into the need for redundancy in efficient data structures. 
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Unique Representation on a RAM. There has been much recent progress on efficient uniquely represented 
data structures in the RAM model. Blelloch and Golovin [3] described a uniquely represented hash table 
for the RAM model supporting insertion, deletion and queries in expected constant time, using linear 
space and only 0(log n) random bits. They also provided a perfect hashing scheme that allows for 0(1) 
worst-case queries, and efficient uniquely represented data structures for ordered dictionaries and the order 
maintenance problem. Naor et al. [13] developed a second uniquely represented dynamic perfect hash table 
supporting deletions, based on cuckoo hashing. Blelloch et al. [4] developed efficient uniquely represented 
data structures for some common data structures in computational geometry. Several other results, as well 
as a more comprehensive discussion of uniquely represented data structures, may be found in the author's 
doctoral thesis [9]. 

Unique Representation in External Memory Models. Progress in the RAM model suggested that it 
might be possible to construct efficient uniquely represented data structures for organizing information on 
disk, i.e., for external memory (EM) models of computation [21] which account for the fact that modem 
computers have a memory hierarchy in which external memory (e.g., a disk drive) is several orders of 
magnitude slower than internal memory (e.g., DRAM). For background on the extensive body of work on 
conventional data structures in EM models, we refer the interested reader to the excellent book by Vitter [21]. 
Within this body of work, extendible hash tables and the B-tree and its variants (e.g., the i?+-tree and the 
B*-tree) play a prominent role. It is worth noting here that the extendible hashing construction of Fagin et 
al. [8] is almost uniquely represented, and in fact can be made uniquely represented with some minor 
modifications, the most significant of which is to use a uniquely represented hash table [3, 13] to layout 
blocks on disk. However, in this paper we focus on uniquely represented B-tree analogs, which can support 
efficient one-dimensional range queries. The first such uniquely represented B-tree analog was the B-treap 
developed by the author [10]. It supports the following operations. 

• insert (x): insert element X. 

• delete(x): delete element X. 

• lookup(x): determine if x is present, and if so, return a pointer to it. 

• range-query (x, y): return all elements between x and y stored in the data structure. 

It is easy to associate auxiliary data with the elements, though for simplicity of exposition we will assume 
there is no auxihary data being stored. 

The following result, proved in [10] indicates that the B-treap has performance comparable to B-trees, up 
to constant factors. Empirical results in [10] suggest that these constants are extremely modest in practice 
(e.g., less than 1.5 factor increase in depth). 

Theorem 1.1 ([10]) There exists a uniquely represented B-treap that stores elements of a fixed size, such 
that if the B-treap contains n elements and the block transfer size B = Q, ((ln(n))^''^~'^') for some 
e > 0, then lookup, insert, and delete each touch at most 0(- log^(n)) B-treap nodes in expectation, and 
range-query touches at most 0{- log^(n) + k/ B) B-treap nodes in expectation where k is the size of the 

output. Furthermore, if B = 0{n^^ ) for some 5 > 0, then with high probability the B-treap has depth 
0{- log^(n)) and requires only linear space to store. 

The B-treap thus performs the functions of a B-tree in a uniquely represented manner with remarkably 
low overhead. However, as a practical matter the B-treap is an extremely complicated data structure and 
is difficult to implement. For this reason we introduce the B-skip-list, which has nearly all of the same 
theoretical guarantees, but is conceptually simpler and far easier to implement. As the name suggests, the 
B-skip-list is based on the skip-lists of Pugh [17]. We prove the following guarantees for the B-skip-list. 



Theorem 1.2 There exists a uniquely represented B-skip-list that stores elements of a fixed size, such that 
if the B-skip-list contains n elements and the block transfer size is B, then lookup, insert, and delete 
each touch at most 0(log^(n)) external memory blocks in expectation, and range-query touches at most 
0(log^(n) + k/B) external memory blocks in expectation where k is the size of the output. Furthermore, 
with high probability the B-skip-list requires only linear space to store. 

1 Preliminaries 

Terminology & Basic Notation. We let X denote the (ordered) universe of elements that we may store. 
Since the B-skip-list will store duplicates of some of the elements, we say that the B-skip-list is comprised of 
nodes which are labelled with the element that they store. Nodes may also contain abstract pointers to other 
nodes, as described in Section 2.2.1. For n G Z, let [n] denote {1,2,..., n}. 

2.1 The External Memory Model 

We use a variant of the parallel disk model of Vitter [21] with one processor and one disk, which measures 
performance in terms of disk I/Os. Internal memory is modeled as a 1-D array of data items, as in a RAM. 
External memory is modeled as a large 1-D array of blocks of data items. A block is a sequence of B data 
items, where 5 is a parameter called the block transfer size. The external memory can read (or write) a single 
block of data items to (or from) internal memory during a single I/O. Other parameters include the problem 
size, n, and the internal memory size m, both measured in units of data items. We will assume m = io{B), 
so that any constant number of blocks can be stored in internal memory. 

2.2 Uniquely Represented Dynamic Memory Allocation 

In this section we show how to perform uniquely represented memory allocation, which allows us to reduce 
the problem of constructing uniquely represented data structures to the problem of constructing data structures 
with a canonical pointer structure. 

2.2.1 Memory Allocation with Fixed Length Keys 

Uniquely represented hash tables [3, 13] can be used as the basis for a uniquely represented memory allocator 
Intuitively, if the nodes of a pointer structure can be labeled with distinct hashable labels in a uniquely 
represented (i.e., strongly history independent) manner, and the pointer structure itself is uniquely represented 
in a pointer based model of computation, then these hash tables provide a way of mapping the pointer 
structure into a one dimensional memory array while preserving unique representation. Pointers are replaced 
by labels, and pointer dereferencing is replaced by hash table lookups. We call such labels abstract pointers. 
We will assume that the data items have distinct hashable labels, which can be used to generate distinct 
hashable labels for the nodes of the B-skip-list in a uniquely represented manner. 

Given these labels, and a data structure composed of nodes and abstract pointers with a uniquely 
represented pointer structure, we can hash the nodes of the data structure into external memory. If we use 
distinct random bits for the hash table and everything else, this inflates number of the expected I/Os by 
a constant factor. (For B-skip-lists, uniquely represented dynamic perfect hash tables [3, 13] may be an 
attractive option, since in expectation most of the I/Os will involve reads, even in the case of insertions and 
deletions.) 



2.2.2 Memory Allocation with Variable Length Keys 

In our external memory model, we can obtain significantly better performance if we allocate large data objects 
in consecutive regions of the external memory, rather than breaking them up and hashing the pieces to various 
pseudorandom locations. Therefore we would like to perform efficient memory allocation with variable 
length keys. In addition to the undesirable solution of breaking large objects into smaller pieces and using a 
uniquely represented hash table H to allocate each piece directly using H, there are at least two other options. 
First, we could use i/ as a virtual memory scheme which allocates fixed sized regions of consecutive blocks 
to each large data object, and then using another memory allocator to store the object as though its blocks are 
the entire memory. A second method, which is more elegant and arguably more suitable for external memory 
models, is to simply store variable sized objects directly in H, as intervals in the memory array. Recently 
Thorup [20] analyzed the running time of such a variant of linear probing. He showed that using sufficiently 
random hash functions, this variant supports operations in time linear in the string length plus a measure of 
the variance of the string lengths, namely the ratio of the second moment of the string lengths to their mean; 
for the cases we are interested in the latter will be constant. This variant stores a set of strings S (each of 
which is assumed to terminate with an end-of-string character) from a universe of strings S such that each 
string s 

• is stored in a contiguous block of hash table slots 

• is preceded in memory by either an end-of-string character or an empty slot 

• is stored such that its first character lies between h{s) and the first empty slot after h{s) (i.e., the empty 
slot y minimizing y — h{s) mod p if there are p slots) in the table storing keys S \ {s}, inclusive. 

It is straightforward to see that the uniquely represented hash table of Blelloch and Golovin [3] can be easily 
modified to store variable length objects in this way. To prove unique representation, we can reduce this case 
to the case with fixed length keys handled in [3] as follows. Treat each string s = si ■■■ Sk of k characters as 
k keys {si, . . . , s^.} such that each Si hashes to the same location and each slot prefers Sj slightly to Sj+i for 
all i, but for all distinct strings s and s' , each slot either prefers Si to s' for all i and j or each slot prefers 
s' to Si for all i and j. For the required slot preferences over the keys, we may use lexicographic ordering 
for the strings, and derive the relative slot preferences for the characters of each string as described above. 
Note that we describe this reduction only for purposes of analysis and to provide insight; we do not suggest 
implementing the memory allocator in this manner, i.e., by treating each string as a sequence of keys. The 
problem has structure that allows us to work with the strings directly. The running time for the operations is 
thus dominated by the displacement' and the length of the input string. Hence Thorup's bounds in Theorem 
6.1 of [20] apply to our modified table as well. Before we describe the technical result which follows from 
this observation, we recall the definition of fc-universal hash functions [22]. 

Definition 2.1 (fc-Universal Hash Functions) A k-universal hash function from X toY is a random func- 
tion h : X ^ Y drawn from some distribution such that the images of any k keys under h are dis- 
tributed uniformly at random. That is, for all distinct xi, . . . ,Xk € X and for all yi, . . . ,yk S Y, 

Vrh [Vi G [k] . h{xi) = yi] = l/\Y\K 

We are now ready to state our result. 

Theorem 2.2 Let S C S be a set of variable length strings from universe S. Let length(s) denote the length 
of string s, including an end-of-string character Fix a parameter p such that YlseS length(s) < ap, for 
load a < 0.9. Let hi be a 1-universal hash function from S to [p] and let /12 be a 5-universal hash function 
from [p] to [p\. Given a hash table H and so G S, let 5h{so, S, hi, /i2) denote the displacement of sq in H if 



' Recall that in a linear probing hash table, the displacement of a key s is the number of slots s must unsuccessfully probe before 
finding an empty slot. In other words, if s is inserted into slot y of abash table with slots {0,1,2, ... ,p — l},its displacement is 
(2/ — s) mod p. 



H is currently storing S and using /i2 o hi as its hash function from S to [p]. Then there exists a uniquely 
represented hash table H such that for all sq £ S, 



E[dH{so, 5, hu h2)] < length(so) + O I a'/^ 

and for all sq ^ S, 

E[5H{so,S,hi,h2)] = \a 



1/3 Eses length (s) 



T.S&S length(s) 
E.e5length2(s)\ 



E.65length(s) 



where the expectation is taken over {hi, /12). 



Theorem 2.2 only gives bounds on the displacement of sq, rather than the total amount of work needed 
to, say, insert sq. The latter quantity is roughly equal to the displacement, plus the length of sq, plus the 
number of characters we need to displace (i.e., move forward) to make room for sq. However, this latter 
quantity is equal to the displacement of a new key s'q immediately after sq is inserted, assuming s'q hashes 
to same location as sq. Thus we can convert the bounds on displacement in Theorem 2.2 into bounds on 
running time with only a slight increase in the bound due to sq. It is worth noting that in our application, 

Y^^^ icrf thCs) ^^^^ ^® constant with high probability, and Theorem 2.2 will be used to prove that with the 
right hash functions our variable length key memory allocator will take expected time linear in the key length 
to insert, delete, or lookup a key. 

3 B-Skip-Lists 

3.1 Review of Skip Lists 

Skip lists were developed by William Pugh [17] as a simple randomized alternative to balanced binary search 
trees. Let X be the set of elements we wish to store. The idea behind skip lists is assign a random positive 
integer level to each element x £ X, such that Pr[level(a;) = k] decreases exponentially in k, and for each k 
such that some element has level(2;) > k, maintain a list L^ of all elements x with levels) > k. We also 
assume there is a special element front in every list that is smaller than all other elements, so as to be at the 
front of every list. Note each list L^ = {xi, X2, ■ ■ ■ ,xt) partitions the elements X into contiguous subsets 

{{x : Xi < X < Xj+i} : 1 < i < t} U {{x : Xt < 2;}} 

and the partition induced by Lk-i refines^ the partition induced by Lk- Additionally, we maintain pointers 
from each element x S L^ to x G -^a;-i. Let xi be the representative of {x : Xi < x < Xj+i}. When 
searching for an element x, we start at the highest level list Lk and search for the representative y^ of the 
level k partition set containing x, then starting from y^ in L^^i search for the representative yk~i of the level 
k — 1 partition set containing x, and so on until reaching the representative yi = x of the level one partition 
set containing x. See Figure 1. 

The original skip list paper [17] sets element levels by repeatedly flipping a biased coin up to (/3 — 1) 
times, for some parameter /3 which bounds the maximum allowable level. The level of the element is then set 
to one plus the number of coin flips to obtain an outcome of 'heads'. If all (/3 — 1) coin flips turn up 'tails,' 
then the level is set to /3. The distribution over levels is thus parametrized by the probability of tails q € (0, 1) 
and a bound /3 on the maximum allowable level. Specifically, it is 

( q^''-^\l-q) iil<k< /3 
Pr[level(a;) = k] = I qO^-^) if k = f3 (3.1) 

(0 iffc<lorA;>/3 

^A partition {Ai, . . . ,71^} of X is said to refine a partition {Bi, ..., Br} of X if for all Ai there is a Bj such that At C Bj. 



Using this distribution with q = 1/2 and /3 = [log2(n)] to sample the levels independently for each element, 
lookups, insertions, deletions, splits, and joins can be implemented in O(logn) time [17, 16]. 

Inspecting Figure 1 , it is fairly easy to see that the pointer structure of the skip list is uniquely represented 
if the levels are generated using a hash function. We require that level(-) be a deterministic function (which 
may be drawn at random from a suitable hash family), so that the levels are not influenced by what other 
elements are present, and so they are consistent if an element is deleted and later reinserted. 



o 



Search path^ 
for key '12' 




"© *® ^© ^@^ 



Figure 1: A skip list, with the search path for key 12 illustrated. 



3.2 The B-Skip-List 

We suppose we have a capacity N bounding the maximum size (i.e., number of nodes) of the B-skip-list. Fix 
a parameter 7, whose purpose will be explained below. In practice 7 will typically be a small multiple of 
B. The B-skip-list is a "blocked version" of a uniquely represented skip-list on n nodes, where the levels 
are chosen from the distribution in equation 3.1 with q = I/7 and /3 = pog (iV)] + 2. To maintain unique 
representation, the distribution should be sampled using a hash function of the elements, so that if an element 
is deleted and then reinserted it has the same level. We refer the interested reader to Section 5.1 of [9] for 
further details on the construction of uniquely represented skip lists. 



The B-Skip-List Partition Invariant. The "blocking" of the uniquely represented skip list is rather simple. 



Recall Lk is the list of elements at level k or higher. Let L^ 
partition L^ as 



X^,X2, 



1ifcl 



For each level k, we 



{{x:x& Lk, xf+^ < X < xf+/} : 1 < i < |Lfc+i|} U {{x : x e Lfc, xf+;^^| < x}} 

Refer to Figure 2 for an example. Each of these partitions will be a key (in the memory allocator) containing 
a list of skip list nodes. A list cr of |(t| skip list nodes is then stored across at most [|(T|/i?] + 1 contiguous 
blocks of external memory. We allow a to be stored starting in the middle of a block. Note that the hash table 
underlying the memory allocator does not work over an array of blocks, but rather an array of finer grained 
subdivisions of blocks, e.g., over chunks of memory large enough to store a B-skip-list node. We say a spans 
the blocks in which it is stored. In the terminology of Section 2.2.2, cr is a string of length \a\. We call these 
constraints on how the B-skip-list is stored the partition invariant. The label to be used to hash a partition 
into external memory for purposes of memory allocation can be constructed from the minimum skip list node 
in the partition, and the level of the partition. Operations on the B-skip-list can proceed almost exactly as 
with the regular skip-list, with two differences. The first difference is that pointers between partitions must be 



replaced with abstract pointers (i.e., labels), and pointer dereferencing between partitions must be replaced 
with hash lookups, as discussed in Section 2.2 and in more detail in [3, 9]. The second difference is that the 
data structure must preform some more work than a standard skip list, in order to maintain the partitions and 
the invariants on how they are stored. 




Figure 2: A skip list partitioned into smaller lists. Each partition corresponds to a gray region. Let (x, k) 
denote the node labeled x in list L^. Then, e.g., {(7, 2), (10, 2)} is a partition because it consists of an 
element of level greater than two, namely, 7, and all elements between it and the next such element, 23. 



The Tuning Parameter 7. We can now describe the role of 7. The levels are geometric random variables 
with parameter (1 — I/7) truncated at /?. The size of a partition set is a random variable that is also described 
by a geometric random variable with parameter I/7 and support [n]. As such the latter has expectation of 
roughly 7. The parameter 7 thus controls the expected size of a partition set, and for this reason 7 = Q{B) is 
appropriate in most applications. Since these partition sets are stored as strings over "characters" large enough 
to represent a B-skip-list node, the sizes of the partitions, and particularly how many blocks are required to 
store them, play an important role in the performance of the B-skip-list. Too high, and the insertion of an 
element may often cause large partition sets to be moved on disk during memory allocation, resulting in poor 
performance. Too low, and searches will take longer due to the increased height, because as 7 decreases the 
height [log^ A^] + 2 increases. 



Maintaining the Partitions. When an element y with level(y) > 2 is inserted, it splits existing partitions 
in two. Specifically, if (x, k) denotes the node labeled x in list L^, and {{xi, k), (x2, A;), . . . , (xj, k)) is a 
partition before y is inserted, and \e\e\{y) > k and Xi < y < Xj+i in element order for some i, then inserting 
y splits the partition into ((xi, A:), (x2, k),. . . , (xj, A;)) and {{y, k), (xj+i, A;), (xj+2, k), . . . , [xt, k)). Note 
that inserting y causes one such split for each level less than level(y). Each such split involves deleting the 
old partition, generating the new partition lists and memory allocating them. 

Similarly, when an element y with level(y) > 2 is deleted, its deletion causes level(y) — 1 existing 
partitions to merge, in effect appending the two lists of nodes together. That is, if there are partitions 
((xi, A:), (x2, k), . . . , (xj, A;)) and {{y, k), (xj+i, A;), (xj+2, A;), . . . , (xj, A;)), where Xj immediately precedes 
y in Lfc, then deleting y requires us to merge the two partitions into ((xi, A;), (x2, k), . . . ,{xt,k)). To find 
all such partitions, it suffices to search for the predecessor of y in the B-skip-list. Thus when deleting y, we 
should actually search for its immediate predecessor x, and use the lowest level instance of x, i.e., (x, 1), 
to find y. The path from the highest level instance of the front element to the predecessor (x, 1) will then 
contain elements in all partition sets with elements less than y that need to be merged during the its deletion. 



These can easily be used to the find all the partition sets with elements greater than y hat need to be merged 
during its deletion, simply by following the skip-list abstract pointers. 

4 Analysis of the B-Skip-List 

We analyze the B-skip-list using the number of I/Os as the cost of an operation. This cost model is common 
in external memory data structures and is justified insofar as disk I/Os are several orders of magnitude slower 
than main memory operations on modem hardware. To analyze the performance of the B-skip-list in this 
model, we need to bound its depth (i.e., the number of levels in it), understand the distribution on partition 
set size, and understand how the partition set size distribution impacts the time to perform the hash table 
operations involved in memory allocation in external memory. 

For purposes of analysis, we will assume that {level(x) : x G X} are independent, where X is the 
universe of elements. To prove similar theoretical guarantees without this assumption, note that there exist 
sophisticated, efficiently computable hash functions that are A''-wise independent with high probability that if 
used to generate {level(x) : x G X} ensure our results will hold with high probability [15, 7]. 

Bounding the Depth 

We begin with the simple fact that, by construction, the depth is at most /3 = [log (A^)] + 2. Re- 
call q = 1/7. For n « N, we can bound the depth as the maximum level of any element via the 
union bound. That is, Pr[level(x) > k] < q^~^, which implies Pr[3x, level(x) >k]<n- q^'^, and so 
Pr [depth > (c + 1) log^(n)] < n'". 

The Distribution on Partition Sizes 

We prove the following exponential tail bound on the distribution in partition set sizes. 

Lemma 4.1 If^ > 2 then for each partition set S, Pr[|5| > A] < exp (—A/7). 

Proof: First we consider a partition set S containing nodes at level k < fi, where /3 is the maximum possible 
level. Let (x, k) denote the level k instance of x. Let (y, k) be the smallest node in S, by element order 
(i.e., the order in which (y, k) < {x, k) if y < x). Consider the elements x G L^ greater than or equal to y. 
Each has level(x) > k, and if level(x) > k, then (x, k) and everything greater than it in level k are not in 
S. Straightforward calculation reveals Pr[level(x) > k \ level(x) > k] = q. Alternately, this fact is readily 
apparent if we note that each level (x) can be generated by repeatedly flipping a coin with bias {1 — q) until it 
comes up heads, truncating after /3 — 1 flips if necessary. Since the levels are independent by assumption, and 
IS"! > A implies (y, k) and the succeeding A — 1 nodes in level(A;) have not been assigned any level above k, 

Pr[|5| > A] < (1 - qf < exp {-qX) = exp (-A/7) 

where we have used 1 — q < e""? for all g G M. 



It remains to prove the bound for the unique partition set Sp at level /? = [log (A^)] + 2. For this, we 
note that Pr[x G S*] = qi^~^, and given the events {x G S} are independent, we can write 

Pi'[|'S'/3| > A] < Pr [there exist A elements with level /3] 

<n^(l/7)^(N7Wl+i) 
< {n/N)^ ■ 7-^ 

<7 



^ =exp(-Aln(7)) 



Assuming 7 > 2, we see ln(7) > I/7 and thus we obtain the desired bound of Pr[|S'/3| > A7] < exp (—A/7) 
for this set as well. ■ 

The Running Time of Memory Allocation 

Lemma 4.1 in conjunction with Theorem 2.2 yields the following performance bound for the memory 
allocator of Section 2.2.2. 



Lemma 4.2 Ifj = 0{B), then the memory allocator operations of inserting, deleting, or looking up a 
partition set S in external memory require only 0{\S\/ B + 1) I/Os in expectation. 



Proof: Let C be the collection of partition sets. Consider the memory allocator as operating over strings 
representing partition sets in which the characters are B-skip-list nodes. Then by Theorem 2.2 the running 

time to insert, delete, or lookup 5 G C will be 0(|5| + E ^'^'^ \g, ) in expectation. Since this is a bound 
on the displacement after S is inserted, it implies that the number of external memory I/Os will be at most 
1/B times this quantity, plus one additional I/Os to deal with the consideration that the partitions may not be 

block ahgned, but may start in the middle of a block. We now bound E ^^ec 
, am, we have Y.i a? < (Ei Oj^^ 



.ggc 



. First, note that for all 



positive real oi, 



Thus 



T.s^c\S? 



Esecl^l 



< 



s&c 



and so we take expectations and bound E 



Esse 1^1' 



by E[X;5ec \S\]- Using Lemma 4.1 and E[X] 



^^>^ Vy[X > x] for nonnegative integer valued random variables X, we obtain 



E[|5|] < ^e- 



■x/7 



-1/7 



X>1 



-1/7 



9(7) 



Supposing 7 = 0{B), the result is that 



E 



i:s^c\s\ 



0{B) 



and so the contribution of external memory I/Os by the y-?*^*^ 



-Sec 



\S\ 



portion of the displacement bound is 



constant. Hence the expected number of external memory I/Os will be 0{\S\/B + 1). 



Running Time Analysis of the B-Skip-List 

We are almost ready to analyze the running time of the B-skip-list, but first we must introduce some notation 
and prove a lemma. Consider a conventional skip-list lookup(y) operation, and suppose the search path is 
((xi, ki), . . . , {xt, kt)), where (x, k) is the instantiation of x at level k. We say the path enters level k at x, if 
the path contains ((x, /c + 1), (x, k)) as a subpath. The skip-list search procedure ensures that each level is 
entered at only one place, if at all. Additionally, we say the search path enters level k at partition set S if it 
enters level A; at x for some x G 5. We make the following claim, which allows for some optimization. 

Lemma 4.3 During a lookup(y) operation in a B-skip-list, at each level k we need never inspect an instance 
(x, k) of element x > y that lies in a dijferent partition from the one the operation's search path entered. 

Proof: [sketch] At the maximum possible level /? there is nothing to prove since there is only one partition 
set at level /?. Moreover, we find the least upper bound on y in Lp. The key to the proof is to observe that 
once we have the least upper bound x^+i on y in L^+i, if we enter level k at w^ and proceed to search 
forward in L^, then after coming to the final element in the partition set containing w^, the very next element 
is (xfc+i, k), which we have already established as being greater than y. This follows from the design of 
the partitioning. In this case we simply declare x^+i to be the least upper bound in L^ as well as Lfc+i- 
Otherwise, we either find y, in which case we are done, or we find an upper bound Xk for y which is smaller 
than Xfe+i and indeed is the least upper bound for y in L^. ■ 

Using Lemma 4.3 it is straightforward to prove the following result. 

Lemma 4.4 During a lookup operation, at most one partition set from each level is inspected. 
We are now ready to prove Theorem 1.2. 

Proof of Theorem 1.2: We start by proving the expected cost bound of 0(log^(n)) external memory I/Os 

for lookup, insert, and delete operations. Fix the capacity N, let ^ = B, and let /3 = |^log^(A^)] + 2 be the 
maximum level. Note the number of memory allocator operations involved in maintaining the B-skip-list 
partition invariant during an insertion or deletion can be amortized against the number of memory allocator 
operations of a lookup, up to constant factors. Lemma 4.2 then allows us to amortize - up to constant factors 
- the number of I/Os for an insertion or deletion against that of a lookup, and the cost of a lookup against the 
total number of blocks spanned by all partitions entered by the search path during the lookup. So consider 
a lookup(x) operation. By construction there are /3 levels, and by Lemma 4.4 at most one partition set per 
level is inspected. Let Zi be the random variable equal to the number of external memory blocks spanned by 
the partition at level i entered by the search path of the lookup(x) operation. The number of I/Os is then 
0(Ylii=i Zi), for reasons described in the proof of Lemma 4.3. We have already shown in Lemma 4.1 that 
for a random partition set S, Pr[|S'| > A] < exp (—A/7). One might be concerned that the distribution of 
sizes of partition sets seen on a search path to an arbitrary element might be skewed towards larger sets, and 
thus increase the expected number of I/Os - a data structures analogue of the "waiting time paradox" from 
queueing theory. We show that this is not a problem by analyzing the search path in reverse, starting at x and 
proceeding up the levels and backward in the lists. Note that at any fixed level i, the search path accesses at 
least z blocks of external memory when progressing backwards over Lj only if the level i partition set Si 
entered by the search path has at least {z — 2)B elements smaller than or equal to x. Using the same argument 
as in the proof of Lemma 4. 1, we can prove that this occurs with probability at most exp {—{z — 2)B/^); 
essentially, we work our way backwards over Lj until encountering the first element y with level(y) > i. The 
key point is that conditioning on the portion of the search path we have seen so far (i.e., from some element y 
until the end the search path at x) does not tell us anything about the levels of elements smaller than y, and 
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hence does not change the conditional distributions of their levels^. From this we infer 

Pr[Zi >z]< exp {-{z - 2)B/-f) 
and thus, using E[Z] = ^2>i Pr[Z > z] for nonnegative integral random variables Z, we obtain 

Linearity of expectation then yields the final bound of 

/9 



E^^ 



=1 



^i^^^^-r^^^!?^'(N,(A')H2) 



Hence setting j = B gives a bound of £^ ([log^(A^)] + 2). The bound for range-query(x, y) follows 
easily, as we can simply lookup(x) and then scan the lowest level list Li starting from x until we reach y, 
which involves reading from at most an additional k/B + 2 blocks of external memory, and hence 0{k/B) 
more I/Os in expectation. 

We next show a bound on the space required to store the B -skip-list. The variable length key hash 
table which we use for memory allocation is guaranteed to require only space linear in the number of 
B-skip-list nodes [9]. Hence, it suffices to show that the number of B-skip-list nodes is linear in the 
number of list elements. Since each element x is duplicated level(x) times, it suffices to bound E[level(a;)]. 
Note that level(x) is stochastically dominated by a geometric random variable with success parameter 
1 — q = 1 — -, and hence has mean 1/(1 — q) = 7/(7 — 1)- Additionally, by the independence of 
{level(x) : x G X} we can apply standard concentration inequalities such as Chemoff bounds [6] to show 
that Ylxex level(x) < ^7/(7 — 1) + 0{\/nlogn) with high probability. Hence the number of B-skip-list 
nodes is ^.7/(7 — 1) + 0{^/nlogn) with high probability, and the space required will be linear in the number 
of elements n. ■ 

5 Conclusions 

In this paper we introduced the B-skip-list, a uniquely represented data structure analogous to a B-tree which 
is suitable for use in uniquely represented (and hence strongly history independent) filesystems and databases. 
The B-skip-list supports the same operations as a B-tree, and with similar performance guarantees; it supports 
insertion, deletion, and lookups in 0(log^(n)) external memory I/Os, where B is the blocksize, supports 
efficient range queries, and is space efficient. The only previous uniquely represented data structure with 
these properties is the B-treap [10], which is significantly more complicated and more difficult to implement. 
Hence the principal virtue of the B-skip-list is its relative simplicity and elegance, which it inherits from the 
elegant skip lists of Pugh [17] and the uniquely represented hash table of Blelloch & Golovin [3, 9]. 
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^Note this would not be the case if we tried to analyze the search path in the "forward" temporal direction, rather than using 
backwards analysis. 
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